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Abstract 

In this paper, we study the alternating direction method for finding the Dantzig selectors, 
which are first introduced in [5]. In particular, at each iteration we apply the nonmonotone 
gradient method proposed in [17] to approximately solve one subproblem of this method. We 
compare our approach with a first-order method proposed in [3]. The computational results 
show that our approach usually outperforms that method in terms of CPU time while producing 
solutions of comparable quality. 

Key words: Dantzig selector, alternating direction method, nonomonotone line search, gradient 
method. 

1 Introduction 

Consider the standard linear regression model: 

y = xp + e, (1) 

where y £ is a vector of responses, X £ ?R. nxp is a design matrix, /3 £ W is an unknown regression 
vector and e is a vector of random noises. One widely studied problem for this model is that of 
variable selection, that is, how to determine the support of /? (i.e., the indices of the nonzero entries 
of f3). When p <C n, this problem can be tackled by many classical approaches. In recent years, 
however, the situations where p S> n have become increasingly common in many applications such 
as signal processing and gene expression studies. Thus, efforts have been directed at developing new 
variable selection methods that work for large values of p. A few examples of such methods include 
the lasso [23J, the elastic net [28J, and the more recent Dantzig selector [8]. 

A Dantzig selector for ([1]) is a solution of the following optimization problem: 



mm mi 

s.t. ||Z)- 1 A T (A/3-y)|| 00 <S, 
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where 5 > and D is the diagonal matrix whose diagonal entries are the norm of the columns of 
X. The Dantzig selector was first proposed in [8] and justified on detailed statistical grounds. In 
particular, it was shown that, this estimator achieves a loss within a logarithmic factor of the ideal 
mean squared error, i.e., the error one would achieve if one knows the support of j3 and the coordinates 
of p that exceed the noise level. For more discussion of the importance of Dantzig selector and its 
relationship with other estimators like lasso, we refer the readers to El El [Til E3 OS EH El- 

Despite the importance of the Dantzig selector and its many connections with other estimators, 
there are very few existing algorithms for solving (J2J) . One natural way of solving (J2]) is to recast it as 
a linear programming (LP) problem and solve it using LP techniques. This approach is adopted in 
the package £i-magic j7], which solves the resulting LP problem via a primal-dual interior-point (IP) 
method. However, the IP methods are typically not efficient for large-scale problems as they require 
solving dense Newton systems for each iteration. Another approach of solving ([2]) uses homotopy 
methods to compute the entire solution path of the Dantzig selector (see, for example, [22 [ I14j). 
Nevertheless, as discussed in Section 1.2], these methods are also unable to deal with large-scale 
problems. Recently, first-order methods are proposed for ([2]) in |16[ [3]. which are capable of solving 
large-scale problems. In |16| . problem ([2]) and its dual are recast into a smooth convex programming 
problem and an optimal first-order method proposed in [2] is then applied to solve the resulting 
problem. In [3], problem ([2]) is recast as a linear cone programming problem. The optimal first-order 
methods (see, for example, [HI El [201 EU E] ) are then applied to solve a smooth approximation to 
the dual of the latter problem. 

In this paper, we consider an alternative approach, namely, the alternating direction method 
(ADM), for solving ([2]). The ADM and its many variants have recently been widely used to solve 
large-scale problems in compressed sensing, image processing and statistics (see, for example, [JJ [12l 
[25 l 127 1 [26]). Ln general, the ADM can be applied to solve problems of the following form: 

min f(x)+g(y) 

x,y 

s.t. Ax + By = b, ( 3 ) 
x G Ci,y G C 2 , 

where / and g are convex functions, A and B are matrices, b is a vector, and C\ and C*2 are closed 
convex sets. Each iteration of the ADM involves solving two subproblems successively and then 
updating a multiplier, and the method converges to an optimal solution of ([3]) under some mild 
assumptions (see, for example, [H QJj]). In this paper, we show that ([2]) can be rewritten in the 
form of ([3]), and hence the ADM can be suitably applied. Moreover, we show that one of the ADM 
subproblems has a simple closed form solution, while another one can be efficiently and approximately 
solved by a nonmonotone gradient method proposed recently in [17] . We also discuss convergence 
of this ADM. Finally, we compare our method for solving ([2]) with a first-order method proposed 
in [3] on large-scale simulated problems. The computational results show that our approach usually 
outperforms that method in terms of CPU time while producing solutions of comparable quality. 

The rest of the paper is organized as follows. In Subsection ll.il we define notations used in this 
paper. In Section El we study the alternating direction method for solving problem ([2]) and address 
its convergence. Finally, we conduct numerical experiments to compare our method with a first-order 
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method proposed in [3] in Section [3l 



1.1 Notations 

In this paper, 3f? n denotes the n-dimensional Euclidean space and ffi nxn denotes the set of all m x n 
matrices with real entries. For a vector x £ $t n , \\x\\i, \\x\\2 and ||x||oo denote the 1-norm, 2-norm 
and oo-norm of x, respectively. For any vector x in K n , |x| is the vector whose ith entry is \xi\, 
while sgn(x) is the vector whose ith entry is 1 if Xi > and —1 otherwise. Given two vectors x 
and y in ffl 1 , x o y denotes the Hadamard (entry-wise) product of x and y, max{x,y} denotes the 
vector whose ith entry is max{xj,yj}. The letter e denotes the vector of all ones, whose dimension 
should be clear from the context. Finally, given a scalar a, [a]+ denotes the positive part of a, that 
is, [a] + = max{0, a}. 

2 Alternating direction method 

In this section, we study the ADM for solving ([2]) and discuss its convergence and implementation 
details. 

In order to apply the ADM, we first rewrite ([2]) in the form of ([3]). To this end, we introduce a 
new variable z and rewrite ([2]) as follows: 



Then it is easy to see that is in the form of © with A = X T X, B = —I, c = X T y, d = W, 
C2 = {z : \\D~ 1 z\\ oc < 5}, f = || • ||i and g = 0. Next, in order to describe the ADM iterations, we 
introduce the following augmented Lagrangian function for problem @: 



for some \x > 0. Each iteration of the ADM involves alternate minimization of with respect to 
z and (3, followed by an update of A. The standard ADM for problem (j3|) (or, equivalently, ([2])) is 
described as follows: 

Alternating direction method: 



mm ||/3||i 

s.t. X T (XP-y)-z = 0, 
llD-^Hoo < S. 



(4) 



A) = \\f3\\i + X T (X T Xf3-X T y-z) + ^\\X T Xf3-X T y-z 



1. Start: Let /3°, A G K p and fj, > be given. 



2. For k = 0, 1, . . . 



' z k+1 = argmin L^(z, f3 k , X k ), 



< (3 k+1 £ Argmin L^{z k+1 , (3, A fc ), 



(5) 



^ A fc+i = A fe + ii(X T X j3 k+l - X T y - z k+l ). 



End (for) 
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Before discussing the convergence of the above method, we first derive the dual problem of @ 
(or, equivalently, fl2J)). Note that 

v = min{]|/3[|i : X T (X(3 - y) - z = 0, < $} 



minmax{||/3||i + X 1 {X 1 Xfi - X 1 y - z) : ||£> -1 «||oo < $) 

z,f3 A 

maxmin{||/3||i + X T (X T Xfi - X T y - z) : H-D^lloo < 5} 

A z,/3 

T v \ illnMI. . II vT ■ 



= max{-y J XA-(5||Z)A||i : \\X+ XXW^ < 1}, 

A 

where the third equality holds by strong duality. Thus, the dual problem of ((4|) is given by 

max d{\) := -y T X\- SWDXh 

A L (6) 

s.t. II^XAHoo < 1. 

Now we are ready to state a convergence result for the ADM, whose proof can be found in [3]. 

Proposition 2.1. Suppose that the solution set of ([2]) is nonempty and fi > 0. Let {(z k , (3 k , X k )} 
be a sequence generated from the above alternating direction method. Then {(z k ,X k )} is convergent. 
Furthermore, the limit of {X k } solves ([6]), and any accumulation point of {f3 k } solves ([2]). 

It is easy to observe that the first subproblem in ([5]) has a closed form solution, which is given 

by: 



£.11 

z = argmm 



X T X(3 k -X T y + — 



mm{ max { X T Xfi k - X T y +'—,-5d}. drl 



where d is the vector consisting of the diagonal entries of D. However, the second subproblem 
does not in general have a closed form solution. In practice we can choose /3 fc+1 to be a suitable 
approximate solution instead. Our next proposition states that the resulting ADM still converges to 
optimal solutions. The proof follows essentially the same arguments as |1UI Theorem 8] and is thus 
omitted. 

Proposition 2.2. Suppose that the solution set of ([2]) is nonempty and \i > 0. Let {fk} be a 
sequence of nonnegative numbers with Y^^k < oo. Let {(z ,X )} be generated as in ([5]) while {/3 k } 
is chosen to satisfy: 



inf{||/3 fc -/3|| 2 : E Argmin L„(** 0, A^ 1 ))} < 



for all k. Then {(z k ,X k )} is convergent. Furthermore, the limit of {X k } solves ©, and any accu- 
mulation point of {/3 k } solves ([5]). 

Before ending this section, we present an iterative algorithm to solve the second subproblem in 
(|5|) approximately. Note that this subproblem can be equivalently written as 

2 

+ (7) 



. A* 
mm — 

2 



X T Xf3 - X T y - z k+1 + A " 
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Since the objective function of (J7J) is the sum of a smooth function and the nonsmooth convex 
function £i-norm, the nonmonotone gradient method II recently proposed by Lu and Zhang [17] can 
be suitably applied to approximately solve ([7]). For ease of reference, we present the algorithm below. 
To simplify notations, for any vector v and any real number 7 > 0, we define 

SoftThresh(u, 7) := sgn(-u) o max{0, \v\ — je} . 

Nonmonotone gradient method: 

1. Start: Choose parameters < rj, a < 1, < a < 1 and integer M > 0. Let u° be given and 
set ceo = 1- 

2. For I = 0, 1, . . . 

(a) Let 

d l = SoftThresh (u l - aW/fc(A a/) - u\ A z = Vf k {u l ) T d l + + d% - \\u l \\x. 

(b) Find the largest a G {1, 77, 77 , ...} such that 

fk(u l + ad 1 ) + \\u l + ad l \\i < max + + aaAi. 

[l-M]+<i<l 11 11 j 

Set cti <— q, u l+1 ^— u' + a/(i' and Z Z + 1. 

(c) Update ai + \ = min |max j ^1 ; , a| where = — and (7' = Vfk(u l+1 ) — 
V/ fc (n'). 

End (for) 



3 Numerical results 

In this section, we conduct numerical experiments to test the performance of the ADM for solving 
problem ([2]). In particular, we compare our method with the default first-order method implemented 
in the TFOCS package |3j for ([2]). All codes are written in Matlab and all experiments are performed 
in Matlab 7.11.0 (2010b) on a workstation with an Intel Xeon E5410 CPU (2.33 GHz) and 8GB 
RAM running Red Hat Enterprise Linux (kernel 2.6.18). 

We initialize the ADM by setting /3 = A = 0, and terminate the method once 

/ 111/3*11! -rf(A fe )| \\D- 1 X T (X^-b)\\ 00 -5 HA^A^IU-l ) 
maX | max{||/3*||i,l} ' max{||/3 fc || 2 ,l} ' max{||A fc || 2 , 1} J " ° 

for some tol > 0. For the nonmonotone gradient method subroutine used to compute /3 fc+1 , we set 
77 = 0.5, a = le — 4, a = le — 8 and M = 1, and moreover, we initialize the method by setting 
v? = (3 k . In addition, we terminate this subroutine once 

1 



max{/ fc (u / ) + H^Hi, 1} 



SoftThresh (u l - Vf k (u l ),e 



u l 



< O.ltol 

2 
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for the same tol as above. 

For the first-order method implemented in the TFOCS package [3] for ([2]), we set the restarting 
parameter to be 200 as discussed in [3l Section 6.1]. We experiment with two different smoothing 
parameters: 0.1 (ATI) and 0.01 (AT2). We terminate the first-order method when 



2 < le - 4. 



max{ || 2) 1} 
3.1 Design matrix with unit column norms 

In this subsection, we consider design matrices with unit column norms. Similar to [H Section 4.1], 
we first generate an n x p matrix X with independent Gaussian entries and then normalize each 
column to have norm 1. We then select a support set T of size \T\ = s uniformly at random, and 
sample a vector f3 on T with i.i.d. entries according to the model = £j(l + |aj|) for all i, where 
& = ±1 with probability 0.5 and a* ~ N(0, 1). We finally set y = X/3 + e with e ~ N(0, a 2 1). 

In our experiment, we choose a = 0.01, 0.05, which corresponds to 1% and 5% noise, and 
(n,p,s) = (720i, 2560i, 80i) for i = 1, 10. For each (n,p, s), we randomly generate 10 copies of 
instances as described above. We then set 6 = \Jl log(p) cr as suggested by [8, Theorem 1.1]. In 
addition, we set p = 10/(y/p5) and to/ = le — 3 for the ADM. Given an approximate solution (3 of 
([2]), we compute a two-stage Dantzig selector (5 by following the same procedure as described in [HJ 
Section 1.6], where we truncate all entries with magnitude below 2a. We evaluate the quality of the 
solutions obtained from different methods by comparing the following ratios that are introduced in 
[HJ Section 4.1]: 

2 = nuvi-h? 2 = s; = i(a-a) 2 ™ 

For convenience, we call them the pre-processing and post-processing errors, respectively. Clearly, 
the smaller the ratios, the higher the solution quality. 

The results of this experiment are reported in Tables Q] and [2j In particular, we present the CPU 
time (cpu), the number of iterations (iter) and the errors p 2 vig and p 2 for all methods, averaged 
over the 10 instances. We see from both tables that our ADM generally outperforms the first-order 
methods implemented in the TFOCS package [3] in terms of both CPU time and solution quality. For 
example, comparing with AT2, which produces solutions with the best quality among the first-order 
methods, our method is about twice as fast and produces solutions with smaller pre-processing errors 
and comparable post-processing errors. 

In FigureHJ we present the result for one instance with size (re, p, s) = (720, 2560, 80) and a = 0.05. 
The asterisks are the true values of (3 while the circles are the estimates obtained by our method 
before the post-processing (the upper plot) and after the post-processing (the lower plot). We see 
from the plot that the latter estimates are very close to the true values of (3. The similar phenomenon 
can also be observed in Figure [2] for the estimates obtained by AT2 on the same instance. 
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Table 1: Results for unit-column-normed X, a — 0.01 





size 






iter 






cpu 






2/2 \ 
P (Porig) 




n 


P 


s 


ADM 


ATI 


AT 2 


ADM 


ATI 


AT 2 


ADM 


ATI 


AT 2 


720 


2560 


80 


13 


601 


562 


2.2 


7.1 


6.6 


1.8(49.2) 


2.2(88.0) 


1.9(74.6) 


1440 


5120 


160 


12 


601 


602 


11.9 


32.3 


32.6 


1.6(58.2) 


1.8(82.6) 


1.4(64.3) 


2160 


7680 


240 


27 


601 


603 


35.0 


67.1 


67.7 


1.5(47.7) 


1.9(83.1) 


1.5(65.7) 


2880 


10240 


320 


28 


601 


602 


59.3 


114.7 


115.3 


1.5(52.7) 


2.0(95.6) 


1.7(70.8) 


3600 


12800 


400 


26 


601 


643 


81.2 


175.3 


188.2 


1.5(55.0) 


1.9(92.2) 


1.6(69.0) 


4320 


15360 


480 


28 


601 


602 


119.5 


250.3 


251.7 


1.6(56.2) 


2.0(95.2) 


1.6(68.7) 


5040 


17920 


560 


28 


601 


622 


157.4 


338.0 


351.1 


1.6(58.9) 


2.0(97.9) 


1.5(69.8) 


5760 


20480 


640 


32 


601 


622 


206.9 


437.7 


455.0 


1.5(58.0) 


1.9(95.4) 


1.6(70.2) 


6480 


23040 


720 


37 


601 


642 


268.3 


552.0 


592.1 


1.5(57.6) 


1.8(93.1) 


1.5(70.4) 


7200 


25600 


800 


39 


601 


622 


334.6 


681.4 


706.7 


1.5(57.5) 


1.9(96.5) 


1.5(70.3) 








Table 2: Results for unit-col 


umn-normed X, a = 0.05 








size 






iter 






cpu 






2/2 \ 
P (Porig) 




n 


P 


s 


ADM 


ATI 


AT 2 


ADM 


ATI 


AT 2 


ADM 


ATI 


AT 2 


720 


2560 


80 


60 


337 


601 


3.8 


4.0 


7.2 


1.4(36.0) 


1.7(57.1) 


1.4(37.5) 


1440 


5120 


160 


50 


340 


602 


19.6 


18.1 


32.1 


1.4(43.7) 


1.7(67.2) 


1.4(44.3) 


2160 


7680 


240 


39 


337 


602 


33.7 


37.6 


67.2 


1.4(45.7) 


1.7(67.7) 


1.4(46.7) 


2880 


10240 


320 


49 


374 


601 


64.6 


71.6 


115.0 


1.5(50.7) 


1.8(75.3) 


1.4(52.2) 


3600 


12800 


400 


52 


342 


601 


96.4 


100.7 


177.0 


1.4(49.9) 


1.8(75.5) 


1.4(51.3) 


4320 


15360 


480 


56 


351 


601 


131.8 


146.3 


251.2 


1.4(48.9) 


1.7(72.7) 


1.4(50.6) 


5040 


17920 


560 


57 


346 


601 


170.5 


194.4 


337.7 


1.5(53.1) 


1.8(78.2) 


1.4(54.3) 


5760 


20480 


640 


60 


344 


602 


207.5 


251.9 


440.5 


1.4(50.9) 


1.7(73.5) 


1.4(51.5) 


6480 


23040 


720 


60 


339 


602 


251.5 


312.0 


554.5 


1.4(49.9) 


1.7(74.6) 


1.4(51.6) 


7200 


25600 


800 


64 


345 


602 


309.3 


390.9 


683.2 


1.5(53.1) 


1.7(79.2) 


1.4(54.9) 



3.2 Design matrix with orthogonal rows 

In this subsection, we consider design matrices with orthogonal rows. We first generate an n x p 
matrix Y with independent Gaussian entries and then set X to be the matrix whose rows form 
an orthogonal basis of the row space of Y. The vector y is then generated similarly as in the 
previous subsection. In particular, we choose a = 0.01, 0.05, which corresponds to 1% and 5% noise, 
(n,p,s) = (720i, 2560i, 80i) for i = 1, 10, and 5 = y/2 log(p) a. For each (n,p,s), we randomly 
generate 10 copies of instances. We set [i = \jb and tol = 2e — 4 for the ADM. 

The computational results averaged over 10 instances are reported in Tables [3] and HI For a = 0.01, 
we observe from Table [3] that our method generally outperforms the first-order methods implemented 
in the TFOCS package |3j in terms of both CPU time and solution quality. In particular, comparing 
with AT2, which produces solutions with the best quality among the first-order methods, our method 
is at least three times faster and produces solutions with smaller pre-processing errors and comparable 
post-processing errors. On the other hand, for a = 0.05, we see from Tabled that our ADM usually 
outperforms the first-order methods in terms of solution quality. In addition, our method is faster 
than AT2 which produces solutions with the best quality among the first-order methods. 

In FigureEl we present the result for one instance with size (n, p, s) = (720, 2560, 80) and a = 0.05. 
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Figure 1: Recovery result for unit-column-normed X from ADM, a = 0.05 
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Table 3: Results for orthogonal X, a — 0.01 











iter 






cpu 






P 2 (Porig) 




n 


P 


s 


ADM 


ATI 


AT 2 


ADM 


ATI 


AT 2 


ADM 


ATI 


AT 2 


720 


2560 


80 


32 


394 


602 


1.3 


4.8 


7.2 


5.0(84.2) 


5.6(105.0) 


5.1(90.8) 


1440 


5120 


160 


28 


362 


602 


7.5 


19.4 


32.3 


4.3(100.1) 


5.2(140.7) 


4.5(111.5) 


2160 


7680 


240 


25 


388 


602 


15.0 


43.8 


68.0 


5.0(102.0) 


5.8(141.2) 


4.9(110.5) 


2880 


10240 


320 


26 


398 


601 


26.8 


76.6 


116.1 


5.1(113.7) 


5.8(158.5) 


5.1(124.8) 


3600 


12800 


400 


25 


389 


602 


38.3 


115.0 


177.8 


4.9(110.0) 


5.5(153.2) 


4.8(119.9) 


4320 


15360 


480 


26 


388 


601 


55.7 


162.8 


251.7 


5.0(110.5) 


5.6(151.9) 


4.8(118.8) 


5040 


17920 


560 


25 


389 


602 


72.2 


221.0 


341.6 


4.6(112.9) 


5.3(158.3) 


4.6(122.0) 


5760 


20480 


640 


25 


384 


602 


93.3 


283.1 


443.4 


4.9(114.5) 


5.5(158.3) 


4.9(125.8) 


6480 


23040 


720 


23 


394 


601 


111.5 


365.7 


557.3 


4.9(117.5) 


5.5(159.5) 


4.8(126.5) 


7200 


25600 


800 


24 


392 


602 


139.9 


446.9 


686.6 


4.8(116.8) 


5.5(162.3) 


4.8(123.6) 



The asterisks are the true values of f3 while the circles are the estimates obtained by our method 
before the post-processing (the upper plot) and after the post-processing (the lower plot). We see 
from the plot that the latter estimates are very close to the true values of f3. The similar phenomenon 
can also be observed in Figure H] for the estimates obtained by AT2 on the same instance. 
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Figure 2: Recovery result for unit-column-normed X from AT2, a = 0.05 

Dantzig selector by AT2 O estimate 



1 

\s .* lag 


6 


6 


#® * * 9 


i i 




500 


1000 1500 


2000 2500 




After post-processing 


O post-process 


i i i i i 
% ® ® 


1 1 1 1 1 



-5 



0< 



500 



1000 



1500 



2000 



2500 



Table 4: Results for orthogonal X, a — 0.05 



n 


size 
P 


s 


ADM 


iter 
ATI 


AT 2 


ADM 


cpu 
ATI 


AT 2 


ADM 


P 2 (Porig) 

ATI 


AT 2 


720 


2560 


80 


165 


227 


440 


3.3 


2.7 


5.1 


4.9(88.9) 


5.7(103.2) 


4.9(92.0) 


1440 


5120 


160 


149 


225 


408 


20.5 


12.3 


22.2 


5.1(97.4) 


5.9(114.1) 


5.5(101.3) 


2160 


7680 


240 


137 


217 


408 


40.3 


24.8 


46.5 


5.1(98.4) 


5.5(112.3) 


5.1(101.1) 


2880 


10240 


320 


129 


220 


405 


65.7 


42.7 


78.0 


4.9(108.4) 


5.5(125.1) 


5.0(112.1) 


3600 


12800 


400 


145 


219 


406 


110.3 


65.1 


120.2 


5.4(108.3) 


6.2(126.4) 


5.7(111.5) 


4320 


15360 


480 


132 


218 


405 


140.6 


92.2 


170.9 


5.1(107.4) 


5.6(121.4) 


5.3(110.7) 


5040 


17920 


560 


125 


211 


409 


182.3 


121.3 


233.1 


4.9(108.6) 
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Figure 3: Recovery result for orthogonal X, a — 0.05 
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